Guidelines for normalising Early Modern English corpora: Decisions and justifications

نویسندگان

Dawn Archer

Merja Kytö

Alistair Baron

Paul Rayson

چکیده

Corpora of Early Modern English have been collected and released for research for a number of years. With large scale digitisation activities gathering pace in the last decade, much more historical textual data is now available for research on numerous topics including historical linguistics and conceptual history. We summarise previous research which has shown that it is necessary to map historical spelling variants to modern equivalents in order to successfully apply natural language processing and corpus linguistics methods. Manual and semiautomatic methods have been devised to support this normalisation and standardisation process. We argue that it is important to develop a linguistically meaningful rationale to achieve good results from this process. In order to do so, we propose a number of guidelines for normalising corpora and show how these guidelines have been applied in the Corpus of English Dialogues.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

VARD 2: A tool for dealing with spelling variation in historical corpora

Spelling variation causes considerable problems for corpus linguistic techniques such as frequency analysis, concordancing and automatic tagging, with a significant impact being made on recall and the accuracy of results [1]. This paper will focus on Early Modern English, the most recent period of the English language to include a large amount of inconsistent spelling. Although many corpora of ...

متن کامل

Normalising the IJS-ELAN Slovene-English Parallel Corpus for the Extraction of Multilingual Terminology

Various efforts have been made for the development of tools and methods dedicated to the automatic processing of multilingual terminology databases. For that purpose, multilingual parallel corpora have been used as a basis resource. However, most of the neologisms in technical and scientific domains are realised by multiword terms that are rarely identified in parallel corpora. In this paper, w...

متن کامل

Comparative Study of the Academic Vocabulary Content of Electronic Engi-neering Corpora, GE Materials and M.S. Entrance Examinations

The importance of vocabulary learning has been underlined in the field of English for Academic Purposes (EAP) because non-English majors who require reading English texts in their fields of study have to expand their English vocabulary knowledge much more efficiently than ordinary ESL/EFL learners. Since academic vocabulary instruction in Iranian universities is realized through the use of Gene...

متن کامل

From semi-automatic to automatic affix extraction in Middle English corpora: Building a sustainable database for analyzing derivational morphology over time

The annotation of large corpora is usually restricted to syntactic structure and word class. Pure lexical information and information on the structure of words are stored in specialized dictionaries (Baayen et al., 1995). Both data structures – dictionary and text corpus – can be matched to get e.g. a distribution of certain (restricted) lexical information from a text. This procedure works fin...

متن کامل

Move Structures in “Statement-of-the-Problem” Sections of M.A. Theses: The Case of Native and Nonnative Speakers of English

Understanding how to structure the “Statement-of-the-Problem” (SP) section of a thesis is necessary for EFL students to develop a logical argumentation for a problem statement. This study intended to compare Move structures of SP sections of theses written by native speakers of Persian (NSPs) and English (NSEs). To this end, 100 SP sections (50 SP sections written by NSE...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Guidelines for normalising Early Modern English corpora: Decisions and justifications

نویسندگان

چکیده

منابع مشابه

VARD 2: A tool for dealing with spelling variation in historical corpora

Normalising the IJS-ELAN Slovene-English Parallel Corpus for the Extraction of Multilingual Terminology

Comparative Study of the Academic Vocabulary Content of Electronic Engi-neering Corpora, GE Materials and M.S. Entrance Examinations

From semi-automatic to automatic affix extraction in Middle English corpora: Building a sustainable database for analyzing derivational morphology over time

Move Structures in “Statement-of-the-Problem” Sections of M.A. Theses: The Case of Native and Nonnative Speakers of English

عنوان ژورنال:

اشتراک گذاری